Text classification using bi-directional similarity

ABSTRACT

A system for classifying text is provided. The system includes a data store containing a plurality of previously observed word sequences and a processor coupled to the data store. The processor is configured to receive a first word sequence and generate bi-directional similarity metrics based on the first word sequence and each of the previously observed word sequences. The processor is also configured to assign a classification to the first word sequence based on at least one of the bi-directional similarity metrics.

BACKGROUND

Textual classification is used in many contexts to ascribe one or morecharacteristics or categories to a set of text. The set of text maysimply be a word, a paragraph, or an entire document or set ofdocuments. Automatic textual classification is highly useful in thatimportant information can be determined automatically about the textwithout requiring a user to read through the text first.

Automatic textual classification, in some contexts, may employ neuralnetworks and/or a Naïve Bayes Classifier. Regardless, typical methodsgenerally require significant computational overhead. In instances wherevast text is generated, traditional methods of text classification maybe too slow and/or beyond the reasonable capacity of the device on whichthe classification is performed.

The discussion above is merely provided for general backgroundinformation and is not intended to be used as an aid in determining thescope of the claimed subject matter.

SUMMARY

A system for classifying text is provided. The system includes a datastore containing a plurality of previously observed word sequences and aprocessor coupled to the data store. The processor is configured toreceive a first word sequence and generate bi-directional similaritymetrics based on the first word sequence and each of the previouslyobserved word sequences. The processor is also configured to assign aclassification to the first word sequence based on at least one of thebi-directional similarity metrics.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter. The claimed subject matter is not limited to implementationsthat solve any or all disadvantages noted in the background.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagrammatic view of a computing system environment in whichautomatic textual classification is useful.

FIG. 2 is a diagrammatic view of a computer system for performingautomatic textual classification in accordance with one embodiment.

FIG. 3 is a flow diagram of a method of classifying text using abi-directional similarity metric in accordance with one embodiment.

FIG. 4 is a diagrammatic view of a computing environment deployed in acloud computing architecture in accordance with an embodiment.

FIG. 5 is one example of a computing system in which embodiments can bedeployed.

DETAILED DESCRIPTION

Embodiments described herein provide a highly efficient system forclassifying text. Current approaches to classifying text involve eithervectorizing a set of words and establishing a numerical distance betweenthe vectors or finding metrics such as edit-distance between the set ofwords. The former suffers from information loss due to normalizationwhile the latter is a non-numeric unidirectional operation that does nottake into consideration the information present in both sets of words.Embodiments described below provide a bi-directional similarity metricbetween two sets of text or word sequences based on information in bothword sequences.

FIG. 1 is a diagrammatic view of a computing system environment in whichautomatic textual classification is useful. Computer system 100 may beany suitable computer system that is used by one or more users 101either locally or remotely, via a suitable connection. Computer system100, in some examples, includes code that is authored and/or changed byone or more developers 104, who interact with system 100. Occasionally,a change in code 106 by developer 104 will result generate an error insystem 100. In order to identify and resolve the problem, user 101 maycall a support center 109 or advisor 108, where advisor 108 will workwith user 101 to identify various operations and contextual informationabout system 100 in order to resolve the problem. However, in somesituations, advisor 108 may not be able to resolve the problem and willescalate the issue to an engineer. The engineer may interact with user101 to obtain additional information; run custom tests; and/or usecustom tools to identify the cause of the problem. This can be a longprocess that involves significant human interaction and patience.

In order to proactively identify problems before they affect users 101,a periodic analysis of all of the logs from system 100 may be performed.While such log analysis is useful in identifying potential problemsbefore they affect users, the log analysis operation is very laborintensive. In order to facilitate the analysis of system 100 beforeusers are affected, a listing of all exceptions and their associatedexception data can be received from system 100 by exception processor112, as indicated by arrow 114. When an error occurs, either the systemor the currently executing application reports the error by throwing anexception containing information about the error. Once the exception isthrown, it is handled by the application or by a default exceptionhandler. An exception generally includes significant informationrelative to the error. For example, the exception may include a textmessage to inform the user of the nature of the error and suggest actionto resolve the problem; stack trace information; a help link; a sourceof the exception, as well as any additional information that may berelevant to the potential cause of the exception. Thus, exceptions canbe rich collections of textual information that provide significantinsight into system operation during an error.

Exception processor 112 may be a component of system 100 or may beseparate therefrom. Exception processor 112 collects a list of all theexceptions that are possible and builds a hyperspace of exceptions,illustrated as points 114, 116, 118 in hyperspace 120. Then,periodically, such as daily, or in response to an event, such as thegeneration of a new exception, exception processor 112 will compare oneor more unclassified exceptions with the known list of exceptions inorder to classify the one or more unclassified exceptions. As can beappreciated, if a new exception is very similar to a previously seenexception, the new exception may be classified as related to thepreviously seen exception. Such similarity may allow a system engineerto assess whether remedial action appropriate for the previously seenexception may also be appropriate or at least similar to appropriateremedial action for the new exception. Conversely, if the new exceptionis not similar to any previously seen exceptions, then exceptionprocessor 112 can automatically escalate the exception to appropriatepersonnel since it may reflect an entirely new type of problem.

While embodiments described herein will be described in the context ofanalyzing exceptions, it should be understood that exceptions are simplyone example of textual information that is amenable to theclassification system and techniques described herein. Embodiments areapplicable classifying any textual information and are certainly notlimited to exceptions.

FIG. 2 is a diagrammatic view of a computer system for performingautomatic textual classification in accordance with one embodiment.System 100 includes one or more processor(s) 150, user interface (UI)component 152, network component 154, data store 156 and exceptionclassifier 112. Processor(s) 150 may be any suitable processing elementthat is able to load and execute instructions in order to perform acomputing function. For example, processor(s) 150 may be one or moreindividual cores in a microprocessor. However, processor(s) 150 can alsobe a vast array of distributed cores working on one or more relatedcomputing tasks. As such, the generic depiction illustrated in FIG. 2 isintended to encompass a significant variety of physical implementationsranging from small embedded computing devices to entire server clusters.

UI component 152, in some examples, is able to generate or otherwisefacilitate interactions with one or more users in order to allow usersto interact with system 100. UI component 152 may generate one or moredialogs or user interface displays to one or more users through anysuitable mechanism, such as a local display of a computing device or viaa web page using a suitable data, such as HTML data.

System 100 includes or is coupled to network component 154, which allowssystem 100 to communicate with other devices through a suitablecommunication network, such as a local area network (LAN), a wide areanetwork (WAN), such as the internet, or a combination thereof. In someexamples, network component 154 may include a wired physical layerfacilitating communication in accordance with the known Ethernetprotocol. However, in other examples, network component 154 may includea wireless communication module(s) in addition to, or instead of, awired physical layer. Regardless, network component 154 allows system100 to communicate with one or more user devices 102 through network160.

System 100 also includes or is coupled to data store 156, which mayinclude a database or other suitable structure for storing a number oftextual collections. Some of the textual collections stored within datastore 156 may be training data that have already been analyzed and/orcharacterized. Additionally, data store 156 may include a number oftextual collections (such as a log of exception data) for whichclassification is required.

Classifier 112 obtains a collection of text, such as an exception orother suitable grouping of text, and classifies the text by determininga bi-directional similarity metric for the collection of text ascompared to one or more previously classified collections of text.Classifier 112 may also include or receive a similarity threshold suchthat if the similarity of the collection of text and one or morepreviously-classified collections of text is above the similaritythreshold, then the collection of text may be assigned a classification.In one example, the collection of text includes exception information.However, embodiments are applicable to a variety of collections of textranging from a couple of words or sentences to entire documents orcollections of documents.

FIG. 3 is a flow diagram of a method of classifying text using abidirectional similarity metric in accordance with one embodiment.Method 200 begins at block 202 where processor(s) 150 obtains a firstword specimen or collection of text and a second word specimen orcollection of previously classified text. Next, at block 204,pre-processing of one or both of the first and second word specimens isperformed. In embodiments where previously classified text or wordspecimens are used, pre-processing the second word specimen need only beperformed once. Thus, block 204, in some embodiments, may onlypre-process the first word specimen. Pre-processing may include removingstop words. Stop words are a pre-defined set of words that arerelatively common, yet do not appreciably add to the accuracy ofclassification. In one example, stop words may include words such as“on”, “which”, “the”, “at”, and “is.” This list can be tailored to theclassification application as well. For example, in the context ofexception classification, words or text that are common to allexceptions, for example, “exception” can be added to the list of stopwords. Next, at block 208, punctuation is removed from the wordspecimens. In some embodiments, all punctuation is removed from the wordspecimens. At block 210, each of the first and second word specimens inalphabetized. Note, in some embodiments, duplicate words in a givenspecimen are retained such that the number of times that a given wordoccurs in the specimens affects the bi-directional similarity metriccalculation.

At block 212, a bi-directional similarity metric is calculated betweenthe first and second word specimens. The similarity metric isbi-directional in the sense that if the text of one specimen is fullyencompassed in the second specimen, but the second specimen containsadditional text not found in the first specimen, then the metric willresult in less than a perfect match. Only if both specimens match eachother identically, will the bi-directional similarity metric return aperfect result. In one example, the bi-directional similarity metricprovides a probability that the word specimens or sequences are thesame. More formally, P(Word Sequence 1 and 2 are same)=P(Word Sequence 1is similar to Word Sequence 2)*P(Word Sequence 2 is similar to WordSequence 1). The probability that Word Sequence 2 is similar to WordSequence 1 is given by the total number of words in Word Sequence 2 thatexist in Word Sequence 1 divided by the total number of words in WordSequence 2. Similarly, the probability that Word Sequence 1 is similarto Word Sequence 2 is given by the total number of words in WordSequence 1 that exist is Word Sequence 2 divided by the total number ofwords in Word Sequence 1. As set forth above, these two probabilitiesare combined, such as by multiplying them together, in order to providethe bi-directional similarity metric. However, embodiments also includeapplying weighting factors such that one direction is favored more thanthe other.

At block 214, the bi-directional similarity metric determined at block212 is compared with a pre-defined threshold in order to determinewhether to apply a classification to the first word sequence orspecimen. If the bi-directional similarity metric is above thepre-defined threshold, then the classification is applied to the firstword sequence, as indicated at block 216. Conversely, if thebi-directional similarity metric is not above the pre-defined threshold,then the first word sequence is not classified, and control passes toblock 218, where control may return to block 202 via dashed line 220 tocompare the first word sequence to another word sequence. In this way,the first word sequence will generally be classified based on itsnearest neighbor in the collection.

The pre-defined threshold to cut-off classification can be learned orotherwise determined using training data and one or more binaryclassifiers can be trained to match a document/exception with variousstack traces/word sequences. While training the classifier, the biasvariance tradeoff is incorporated through the threshold value. As can beappreciated, selection of the threshold value will determine the clusterdensity. For example, lower thresholds will result in broader clusterswhile higher thresholds will result in tighter clusters.

Embodiments described herein are able to quickly utilize previously seenexamples of text in order to classify new sets of text. However,embodiments can also be used to dynamically generate clusters of text.For example, a dynamic cluster can be started with a null set and willinvolve either the addition of a word sequence or incrementing anexisting sequence's counter value depending on the threshold probabilityof a match between the two word sequences. As described above, a lowerthreshold will result is broader clusters while a higher threshold willresult in tighter clusters.

The present discussion has mentioned processors and servers. In oneembodiment, the processors and servers include computer processors withassociated memory and timing circuitry, not separately shown. They arefunctional parts of the systems or devices to which they belong and areactivated by, and facilitate the functionality of the other componentsor items in those systems.

Also, embodiments described herein may employ a variety of userinterface displays. Such user interface displays may have differentforms and a wide variety of different user actuatable input mechanismsdisposed thereon. For instance, the user actuatable input mechanisms canbe text boxes, check boxes, icons, links, drop-down menus, search boxes,etc. They can also be actuated in a wide variety of different ways. Forinstance, they can be actuated using a point and click device (such as atrack ball or mouse). They can be actuated using hardware buttons,switches, a joystick or keyboard, thumb switches or thumb pads, etc.They can also be actuated using a virtual keyboard or other virtualactuators. In addition, where the screen on which they are displayed isa touch sensitive screen, they can be actuated using touch gestures.Also, where the device that displays them has speech recognitioncomponents, they can be actuated using speech commands.

A number of data stores have also been discussed. It will be noted suchdata stores can each be broken into multiple data stores. All can belocal to the systems accessing them, all can be remote, or some can belocal while others are remote. All of these configurations arecontemplated herein.

Also, the figures show a number of blocks with functionality ascribed toeach block. It will be noted that fewer blocks can be used so thefunctionality is performed by fewer components. Also, more blocks can beused with the functionality distributed among more components.

FIG. 4 is a block diagram of computing system 100, shown in FIG. 2,except that its elements are disposed in a cloud computing architecture500. Cloud 502 is composed of at least one server computer, but may alsoinclude other interconnected devices, computers or systems. Cloudcomputing provides computation, software, data access, and storageservices that do not require end-user knowledge of the physical locationor configuration of the system that delivers the services. In variousembodiments, cloud computing delivers the services over a wide areanetwork, such as the internet, using appropriate protocols. Forinstance, cloud computing providers deliver applications over a widearea network and they can be accessed through a web browser or any othercomputing component. Software or components of development environment100 as well as the corresponding data, can be stored on servers at aremote location. The computing resources in a cloud computingenvironment can be consolidated at a remote data center location or theycan be dispersed. Cloud computing infrastructures can deliver servicesthrough shared data centers, even though they appear as a single pointof access for the user. Thus, the components and functions describedherein can be provided from a service provider at a remote locationusing a cloud computing architecture. Alternatively, they can beprovided from a conventional server, or they can be installed on clientdevices directly, or in other ways.

The description is intended to include both public cloud computing andprivate cloud computing. Cloud computing (both public and private)provides substantially seamless pooling of resources, as well as areduced need to manage and configure underlying hardware infrastructure.

A public cloud is managed by a vendor and typically supports multipleconsumers using the same infrastructure. Also, a public cloud, asopposed to a private cloud, can free up the end users from managing thehardware. A private cloud may be managed by the organization itself andthe infrastructure is typically not shared with other organizations. Theorganization still maintains the hardware to some extent, such asinstallations and repairs, etc.

In the embodiment shown in FIG. 4, some items are similar to those shownin FIG. 2 and they are similarly numbered. FIG. 4 specifically showsthat computing system 100 is located in cloud 502 (which can be public,private, or a combination where portions are public while others areprivate). FIG. 4 shows that it is also contemplated that some elementsof computing system 100 are disposed in cloud 502 while others are not.By way of example, data store 108 can be disposed outside of cloud 502,and accessed through cloud 502. Regardless of where they are located,they can be accessed directly by user 102, through a network (either awide area network or a local area network), they can be hosted at aremote site by a service, or they can be provided as a service through acloud or accessed by a connection service that resides in the cloud. Allof these architectures are contemplated herein.

It will also be noted that computing system 100, or portions of it, canbe disposed on a wide variety of different devices. Some of thosedevices include servers, desktop computers, laptop computers, tabletcomputers, or other mobile devices, such as palm top computers, cellphones, smart phones, multimedia players, personal digital assistants,et cetera.

FIG. 5 is one embodiment of a computing system in which embodiments canbe deployed. With reference to FIG. 5, an exemplary system forimplementing some embodiments includes a computer 810. Components ofcomputer 810 may include, but are not limited to, a processing unit 820(which can comprise processor(s) 150), system memory 830, and a systembus 821 that couples various system components including the systemmemory to the processing unit 820. The system bus 821 may be any ofseveral types of bus structures including a memory bus or memorycontroller, a peripheral bus, and a local bus using any of a variety ofbus architectures. By way of example, and not limitation, sucharchitectures include Industry Standard Architecture (ISA) bus, MicroChannel Architecture (MCA) bus, Enhanced ISA (EISA) bus, VideoElectronics Standards Association (VESA) local bus, and PeripheralComponent Interconnect (PCI) bus also known as Mezzanine bus.

Computer 810 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 810 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media is different from, anddoes not include, a modulated data signal or carrier wave. It includeshardware storage media including both volatile and nonvolatile,removable and non-removable media implemented in any method ortechnology for storage of information such as computer readableinstructions, data structures, program modules or other data. Computerstorage media includes, but is not limited to, RAM, ROM, EEPROM, flashmemory or other memory technology, CD-ROM, digital versatile disks (DVD)or other optical disk storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the desired information and which canbe accessed by computer 810. Communication media typically embodiescomputer readable instructions, data structures, program modules orother data in a transport mechanism and includes any informationdelivery media. The term “modulated data signal” means a signal that hasone or more of its characteristics set or changed in such a manner as toencode information in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer readable media.

The system memory 830 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 831and random access memory (RAM) 832. A basic input/output system 833(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 810, such as during start-up, istypically stored in ROM 831. RAM 832 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 820. By way of example, and notlimitation, FIG. 5 illustrates operating system 834, applicationprograms 835, other program modules 836, and program data 837.

The computer 810 may also include other removable/non-removablevolatile/nonvolatile computer storage media. By way of example only,FIG. 5 illustrates a hard disk drive 841 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 851that reads from or writes to a removable, nonvolatile magnetic disk 852,and an optical disk drive 855 that reads from or writes to a removable,nonvolatile optical disk 856 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 841 is typically connectedto the system bus 821 through a non-removable memory interface such asinterface 840, and magnetic disk drive 851 and optical disk drive 855are typically connected to the system bus 821 by a removable memoryinterface, such as interface 850.

Alternatively, or in addition, the functionality described herein can beperformed, at least in part, by one or more hardware logic components.For example, and without limitation, illustrative types of hardwarelogic components that can be used include Field-programmable Gate Arrays(FPGAs), Program-specific Integrated Circuits (ASICs), Program-specificStandard Products (ASSPs), System-on-a-chip systems (SOCs), ComplexProgrammable Logic Devices (CPLDs), etc.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 5, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 810. In FIG. 5, for example, hard disk drive 841 is illustratedas storing operating system 844, application programs 845, other programmodules 846, and program data 847. Note that these components can eitherbe the same as or different from operating system 834, applicationprograms 835, other program modules 836, and program data 837. Operatingsystem 844, application programs 845, other program modules 846, andprogram data 847 are given different numbers here to illustrate that, ata minimum, they are different copies.

A user may enter commands and information into the computer 810 throughinput devices such as a keyboard 862, a microphone 863, and a pointingdevice 861, such as a mouse, trackball or touch pad. Other input devices(not shown) may include a joystick, game pad, satellite dish, scanner,or the like. These and other input devices are often connected to theprocessing unit 820 through a user input interface 860 that is coupledto the system bus, but may be connected by other interface and busstructures, such as a parallel port, game port or a universal serial bus(USB). A visual display 891 or other type of display device is alsoconnected to the system bus 821 via an interface, such as a videointerface 890. In addition to the monitor, computers may also includeother peripheral output devices such as speakers 897 and printer 896,which may be connected through an output peripheral interface 895.

The computer 810 is operated in a networked environment using logicalconnections to one or more remote computers, such as a remote computer880. The remote computer 880 may be a personal computer, a hand-helddevice, a server, a router, a network PC, a peer device or other commonnetwork node, and typically includes many or all of the elementsdescribed above relative to the computer 810. The logical connectionsdepicted in FIG. 5 include a local area network (LAN) 871 and a widearea network (WAN) 873, but may also include other networks. Suchnetworking environments are commonplace in offices, enterprise-widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 810 is connectedto the LAN 871 through a network interface or adapter 870. When used ina WAN networking environment, the computer 810 typically includes amodem 872 or other means for establishing communications over the WAN873, such as the Internet. The modem 872, which may be internal orexternal, may be connected to the system bus 821 via the user inputinterface 860, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 810, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 5 illustrates remoteapplication programs 885 as residing on remote computer 880. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

Embodiments described herein allow a computer system to performclassification of text very quickly with relatively little computationaloverhead. The classification is based on a bi-directional similaritymetric that employs information from both word sequences in an efficientoperation. Thus, the computer is able to perform the textualclassification faster than would otherwise be possible. Additionally, inembodiments where classification is performed on exception information,or other computer-generated textual information, more significantexceptions can be automatically classified and surfaced automaticallyfor remedial action. This reduces the time required by technicians oroperators to read through all of the exception information first or evenat all.

It should also be noted that the different embodiments described hereincan be combined in different ways. That is, parts of one or moreembodiments can be combined with parts of one or more other embodiments.All of this is contemplated herein. Various examples are set forthbelow.

Example 1 is system for classifying text, the system includes a datastore containing a plurality of previously observed word sequences and aprocessor coupled to the data store. The processor is configured toreceive a first word sequence and generate bi-directional similaritymetrics based on the first word sequence and each of the previouslyobserved word sequences. The processor is also configured to assign aclassification to the first word sequence based on at least one of thebi-directional similarity metrics.

Example 2 is a system for classifying text of any or all of the previousexamples, wherein a respective bi-directional similarity metric is basedon a number of words of the first word sequence that are present in arespective one of the plurality of previously observed word sequences aswell as the number of words in the respective one of the plurality ofpreviously observed word sequences that are present in the first wordsequence.

Example 3 is a system for classifying text of any or all of the previousexamples, wherein each respective bi-directional similarity metric isbased a probability of similarity that the first word sequence issimilar to a respective one of the plurality of previously observed wordsequences in combination with a probability of similarity that therespective one of the plurality of previously observed word sequences issimilar to the first word sequence.

Example 4 is a system for classifying text of any or all of the previousexamples, wherein the probability of similarity that the first wordsequence is similar to a respective one of the plurality of previouslyobserved word sequences is based on a ratio of a total number of wordsof the first word sequence that are present in the respective one of theplurality of previously observed word sequences to the total number ofwords in the respective one of the plurality of previously observed wordsequences.

Example 5 is a system for classifying text of any or all of the previousexamples, wherein the probability of similarity that a respective one ofthe plurality of previously observed word sequences is similar to thefirst word sequence is based on a ratio of a total number of words ofthe respective one of the plurality of previously observed wordsequences that are present in the first word sequence to the totalnumber of words in the first word sequence.

Example 6 is a system for classifying text of any or all of the previousexamples, wherein the bi-directional similarity metric is the product ofthe probability of similarity that the first word sequence is similar toa respective one of the plurality of previously observed word sequencesand the probability of similarity that the respective one of theplurality of previously observed word sequences is similar to the firstword sequence.

Example 7 is a system for classifying text of any or all of the previousexamples, wherein the product includes equal weights for each of theprobabilities.

Example 8 is a system for classifying text of any or all of the previousexamples, wherein the probability of similarity that a respective one ofthe plurality of previously observed word sequences is similar to thefirst word sequence is based on a ratio of a total number of words ofthe respective one of the plurality of previously observed wordsequences that are present in the first word sequence to the totalnumber of words in the first word sequence.

Example 9 is a system for classifying text of any or all of the previousexamples, wherein the processor is configured to perform pre-processingof the first word sequence before determining the bi-directionalsimilarity metrics.

Example 10 is a system for classifying text of any or all of theprevious examples, wherein the pre-processing includes removing stopwords.

Example 11 is a system for classifying text of any or all of theprevious examples, wherein pre-processing includes maintaining multipleoccurrences of the same word.

Example 12 is a system for classifying text of any or all of theprevious examples, wherein pre-processing includes alphabetizing thefirst word sequence.

Example 13 is a system for classifying text of any or all of theprevious examples, wherein the plurality of previously observed wordsequences are pre-processed.

Example 14 is a system for classifying text of any or all of theprevious examples, wherein the first word sequence iscomputer-generated.

Example 15 is a system for classifying text of any or all of theprevious examples, wherein the computer-generated first word sequenceincludes exception information.

Example 16 is a system for classifying text of any or all of theprevious examples, wherein the processor is configured to selectivelyapply the classification if at least one of the bi-directionalsimilarity metrics exceeds a pre-defined threshold.

Example 17 is a computer-implemented method for classifyingcomputer-generated text. The method includes pre-processing thecomputer-generated text and at least one previously observed exception.A first probability of similarity of the computer-generated text to theat least one previously observed exception is determined. A secondprobability of similarity of the at least one previously observedexception to the computer-generated text is determined. A bi-directionalsimilarity metric is generated based on the first and secondprobabilities. The computer-generated text is selectively classified ifthe bi-directional similarity metric exceeds a pre-defined threshold.

Example 18 is a computer-implemented method of any or all of theprevious examples wherein the computer-generated text is exceptioninformation.

Example 19 is a computer-implemented method of comparing a first set oftext to a second set of text. The method includes determining a firstprobability of similarity of the first set of text to the second set oftext and determining a second probability of similarity of the secondset of text to the first set of text. A bi-directional similarity metricis generated based on the first and second probabilities. The first setof text is classified based on the bi-directional similarity metric andthe second set of text.

Example 20 is a computer-implemented method of any or all of theprevious examples wherein the first probability is based on a totalnumber of words in the first set of text that are present in the secondset of text divided by the total number of words in the first set oftext; and the second probability is based on a total number of words inthe second set of text that are present in the first set of text dividedby the total number of words in the second set of text.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing the claims.

What is claimed is:
 1. A system for classifying text, the systemcomprising: a data store containing a plurality of previously observedword sequences; a processor coupled to the data store and configured toreceive a first word sequence and generate bi-directional similaritymetrics based on the first word sequence and each of the previouslyobserved word sequences; and wherein the processor is configured toassign a classification to the first word sequence based on at least oneof the bi-directional similarity metrics.
 2. The system of claim 1,wherein a respective bi-directional similarity metric is based on anumber of words of the first word sequence that are present in arespective one of the plurality of previously observed word sequences aswell as the number of words in the respective one of the plurality ofpreviously observed word sequences that are present in the first wordsequence.
 3. The system of claim 1, wherein each respectivebi-directional similarity metric is based a probability of similaritythat the first word sequence is similar to a respective one of theplurality of previously observed word sequences in combination with aprobability of similarity that the respective one of the plurality ofpreviously observed word sequences is similar to the first wordsequence.
 4. The system of claim 3, wherein the probability ofsimilarity that the first word sequence is similar to a respective oneof the plurality of previously observed word sequences is based on aratio of a total number of words of the first word sequence that arepresent in the respective one of the plurality of previously observedword sequences to the total number of words in the respective one of theplurality of previously observed word sequences.
 5. The system of claim4, wherein the probability of similarity that a respective one of theplurality of previously observed word sequences is similar to the firstword sequence is based on a ratio of a total number of words of therespective one of the plurality of previously observed word sequencesthat are present in the first word sequence to the total number of wordsin the first word sequence.
 6. The system of claim 3, wherein thebi-directional similarity metric is the product of the probability ofsimilarity that the first word sequence is similar to a respective oneof the plurality of previously observed word sequences and theprobability of similarity that the respective one of the plurality ofpreviously observed word sequences is similar to the first wordsequence.
 7. The system of claim 6, wherein the product includes equalweights for each of the probabilities.
 8. The system of claim 3, whereinthe probability of similarity that a respective one of the plurality ofpreviously observed word sequences is similar to the first word sequenceis based on a ratio of a total number of words of the respective one ofthe plurality of previously observed word sequences that are present inthe first word sequence to the total number of words in the first wordsequence.
 9. The system of claim 1, wherein the processor is configuredto perform pre-processing of the first word sequence before determiningthe bi-directional similarity metrics.
 10. The system of claim 9,wherein the pre-processing includes removing stop words.
 11. The systemof claim 9, wherein pre-processing includes maintaining multipleoccurrences of the same word.
 12. The system of claim 9, whereinpre-processing includes alphabetizing the first word sequence.
 13. Thesystem of claim 9, wherein the plurality of previously observed wordsequences are pre-processed.
 14. The system of claim 1, wherein thefirst word sequence is computer-generated.
 15. The system of claim 14,wherein the computer-generated first word sequence includes exceptioninformation.
 16. The system of claim 1, wherein the processor isconfigured to selectively apply the classification if at least one ofthe bi-directional similarity metrics exceeds a pre-defined threshold.17. A computer-implemented method for classifying computer-generatedtext, the method comprising: pre-processing the computer-generated textand at least one previously observed exception; determining a firstprobability of similarity of the computer-generated text to the at leastone previously observed exception; determining a second probability ofsimilarity of the at least one previously observed exception to thecomputer-generated text; and generating a bi-directional similaritymetric based on the first and second probabilities; and selectivelyclassifying the computer-generated text if the bi-directional similaritymetric exceeds a pre-defined threshold.
 18. The computer-implementedmethod of claim 17, wherein the computer-generated text is exceptioninformation.
 19. A computer-implemented method of comparing a first setof text to a second set of text, the method comprising: determining afirst probability of similarity of the first set of text to the secondset of text; determining a second probability of similarity of thesecond set of text to the first set of text; and generating abi-directional similarity metric based on the first and secondprobabilities; and classifying the first set of text based on thebi-directional similarity metric and the second set of text.
 20. Thecomputer-implemented method of claim 19, wherein: the first probabilityis based on a total number of words in the first set of text that arepresent in the second set of text divided by the total number of wordsin the first set of text; and the second probability is based on a totalnumber of words in the second set of text that are present in the firstset of text divided by the total number of words in the second set oftext.